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We outline our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, Astro- 
Grid) and the computational grid. We discuss the construction of VOTechBroker, which is a modular software tool 
designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed 
computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We present 
our planned usage of the VOTechBroker in computing a huge number of n— point correlation functions from the SDSS, 
as well as fitting over a million GMBfast models to the WMAP data. 



1. Introduction 

Over a petabyte of raw astronomical data is expected 
to be collected in the next decade (see Szalay & Gray 
2001). This explosion of data also extends to the 
volume of parameters measured from these data in- 
cluding their errors, quality flags, weights and mask 
information. Furthermore, these massive datasets fa- 
cilitate more complex analyses, e.g. nonparametric 
statistics, which are computationally intensive. A 
key question therefore is: Can existing statistical 
software scale-up to cope with such large datasets 
and massive calculations? We address this question 
here. 

We focus here on two exciting new technologies, 
namely the Virtual Observatory (VO) and computa- 
tional grids. However, we point the reader to Jim 
Linnemann paper in these proceedings for an excel- 
lent summary of existing statistical software pack- 
ages in physics and astrophysics. We also direct the 
reader to the recent ADASS conference proceedings 
and the "Mining the Sky" proceedings (www.mpa- 
garching.mpg.de / cosmo/) . 



2. N point Correlation Functions 

As a case study of the types of massive calcula- 
tions planned for the next generation of astronom- 
ical surveys and analyses, we discuss here the galaxy 
n-point correlation functions. These have a long his- 
tory in cosmology and are used to statistically quan- 
tify the degree of spatial clustering of a set of data 
points (e.g. galaxies). There are a hierarchy of 
correlation functions, starting with the 2-point cor- 
relation function, which measures the joint proba- 
bility of a data pair, as a function of their sepa- 
ration r, compared to a Poisson distribution, i.e., 
dPi2 = N^dVi dV2{l + ^(r)), where dPi2 is the joint 
probability of an object being located in both search 
volumes dVi & dV2, and N is the space density of 
objects, ^(r) is the 2-point correlation function and 
is zero for a Poisson distribution. If ^(r) is positive, 
then the objects are more clustered on scales of r 
than expected, and vice versa for negative values. 

The next in the series is the 3-point cor- 
relation function, which is defined as dPi23 — 
N^dVi dV2 dVsil + 62(^12) + 63(?-23) + 63(^-13) + 
Ci23(''i2,'"23,?'i3)), where C12, 62,62 are the 2-point 
functions for the three sides (?'i2, ^'23, ''13) of the tri- 
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angle and ^123 is the 3-point function. Likewise, one 
can define a 4-point, 5-point etc., correlation func- 
tion. The reader is referred to Peebles (1980) for a 
full discussion of these n-point correlation functions 
including their importance to cosmology (see also the 
recent lecture notes of Szapudi 2005). We also refer 
the reader to Landy & Szalay (1993) and Szapudi & 
Szalay (1998) for a discussion of the practical details 
of computing the N-point functions. 

Naively, the computation of the n-point cor- 
relation functions scale as 0(i?"), where R is the 
number of data-points in the sample. As one can 
see, even with existing galaxy surveys from the 
Sloan Digital Sky Survey (SDSS), where R ~ 10^- 
10^, such correlation functions quickly become un- 
tractable to compute. In recent years, there has 
been a number of more efficient algorithms de- 
veloped to beat this naive scaling. For exam- 
ple, the International Computational Astrostatis- 
tics (inCA; www.incagroup.org) group has developed 
a new algorithm based on the use of the multi- 
resolutional KD-tree data structure (mrKDtrees). 
This software, known as npt, is publicly available 
(www.autonlab.org/autonweb/software/10378.html), 
and has been discussed previously in Gray et al. 
(2003), Nichol et al. (2001) and Moore et al. (2000). 
Briefly, mrKDtrees represent a condensed data struc- 
ture in memory, which is used to efficiently answer 
as much of any data query as possible, i.e., pruning 
the tree in memory. The key advance of our npt al- 
gorithm is the use of "n" trees in memory together to 
compute an n-point function. See also Alex Gray's 
contribution in this volume. 



3. Computing Correlation Functions 

Even with an efficient algorithm, the computation of 
higher-order correlation functions is intensive. In 
detail, the n-point correlation functions require a 
large number of sequential calls to the npt code. 
These include computing the cross-correlation be- 
tween the real data (called D) and a random dataset 
(called R), which is used to mimic the edge effects 
in the real data. As outlined in Szapudi & Sza- 
lay (1998), each estimation of a 3-point correlation 
functions, for a given bin of triangular shape (i.e., 
ri2 ± Ari2 , r23 ± Ar23 ' ''is ± , requires seven sep- 
arate source counts over the whole dataset, namely 
DDD, DDR, DRR, RRR, DD, RR, DR. Therefore, 
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Fig. 1. The archtecture of the VOTechBroker and how it in- 
teracts with the Grid, VO and our statistical algorithms. The 
npt algorithm is a "Client" (at the bottom) and interacts with 
the "Broker" via a web-form (HTML) to define the basic pa- 
rameters needed to run the algorithm and define the resources 
needed. Eventually we plan to interact with the "Broker" via 
the AstroGrid workflow environment, allowing the submission 
of jobs as well as the storage of the input data and results in 
MySpace. There can be multiple "Clients" to the "Broker" . 



if one wished to probe 10^ triangle configuration, 
then ~ 10'^ sequential npi jobs are required. This can 
rise rapidly if one wishes to estimate errors on the 
n-point functions using either jack-knife resampling 
(i.e., removing subregions of the data and then re- 
computing the correlation functions), or a large en- 
semble of mock catalogs (derived from simulations) . 
Such computations are well-suited to large clusters 
or grid of computers. 

In recent years, we have used computational re- 
sources like TeraGrid (www.teragrid.org) and COS- 
MOS (www.damtp.cam.ac.uk/cosmos/) to perform 
the computation of the n-point correlation functions 
for the SDSS main galaxy sample and the SDSS LRG 
sample. Our experience shows that the management 
and scheduling of such a large number of jobs on 
these massive machines is laborious and tedious. To 
ease this problem, we are working on VOTechBro- 
ker, which is a tool that joins two new and emerging 
technologies; the VO and computational grids. 
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4. VOTechBroker 

AstroGrid (www.astrogrid.org) is a PPARC-funded 

project to create a working Virtual Observatory for 
UK and international astronomers. AstroGrid works 
closely with other VO initiatives around the world 
(via the International Virtual Observatory Alliance; 
IVOA) and is part of the Euro-VO initiative in Eu- 
rope. In particular, the work outlined here has been 
performed as part of the EU-funded VOTech project, 
which aims to complete the technical preparation 
work for the construction of a European Virtual Ob- 
servatory. Specifically, VOTech is undertaking R&D 
into data-mining and visualization tools, which can 
be integrated into the emerging VO and computa- 
tional grid infrastructure. Therefore, VOTech will 
build upon existing or emerging standards and in- 
frastructure (e.g. IVOA standards and AstroGrid 
middleware), as well as looking at standards from 
W3C and GGF. 

As part of the VOTech research, we are engaged 
in developing the VOTechBroker. The key design 
goals of the broker are to: i) Remove the execu- 
tion and management of a large number of jobs (like 
npt) from the user in a transparent and reusable way; 
a) Accommodate different grid infrastructures (e.g. 
condor, globus etc.); Hi) Locate suitable resources 
on the grid and optimize the submission of jobs; iv) 
Monitor the status and success of jobs; v) Combine 
with AstroGrid MySpace and workflow environments 
to allow easy management of job submission and final 
results (as well as utilizing other algorithms within 
the VO). In Figure 1, we show the schematic design 
of the broker archtecture which illustrates the modu- 
lar and "plug- in" design philosophy we have adopted. 
This is required as one of the key requirements of 
VOTechBroker is that it should be straightforward to 
add new algorithms, resources and middleware (e.g. 
a different job submission tool or protocol). 

We have implemented the core functionality of 
VOTechBroker and are presently testing it by sub- 
mitting ~ 10^ npt jobs on both the UK National Grid 
Servise (www.ngs.ac.uk), COSMOS supercomputer 
and a local condor pool of machines. The key ingredi- 
ents of the present VOTechBroker include GridSAM 
(an open-source job submission and monitoring web 
servise from the London e-Science Centre), the UK 
e-Science X.509 certificates, MyProxy (a repository 
for X.509 Public Key Infrastructure security creden- 
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Fig. 2. Using CMBfast, we liavc varied (baryon fraction) 
and determined wliicli models lie within the 95% confidence 
ball around f{Xi). For this illustration, we have kept all other 
parameters in these CMBfast models fixed at their fiducial 
values. The gray models are within the confidence ball, while 
the others are outside the ball indicating they are "bad fits" 
to the data (at the 95% confidence). We get an allowed range 
of 0.0169 <nt,< 0.0287. 

tials), and the Job Submission Description Language 
(JSDL; a standard description of job execution re- 
quirements to a range of resource managers from the 
Global Grid Forum). At present, the VOTechBroker 
provides a web-form interface to just the npt algo- 
rithm discussed above but is modular in design so 
other algorithms can be easily added via other web 
forms. Results from the VOTechBroker will soon be 
placed in a user's AstroGrid MySpace. In the near 
future, we will interface the broker with other com- 
putational resources, e.g., TeraGrid (see below), and 
the AstroGrid workflow. 

5. Nonparametric Statistics 

In addition to the need for new statistical software 

that scales-up to petabyte datasets, we also reqtiire 
new algorithms and computational resources that ex- 
ploit the emerging power of nonparametric statistics. 
As discussed in Wasserman et al. (2001), such non- 
parametric methods are statistical techniques that 
make as few assumptions as possible about the pro- 
cess that generated the data. Such methods are more 
flexible than more traditional parametric methods 
that impose rigid and often unrealistic assumptions. 
With large sample sizes, nonparametric methods 
make it possible to find subtle efl[ects which might 
otherwise be obscured by the assumptions built into 
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parametric methods. 

In Genovcsc ct al. (2004), wc discuss the appli- 
cation of nonparametric techniques to the analysis of 
the power spectrum of anisotropics in the Cosmic Mi- 
crowave Background (CMB). For example, one can 
ask the simple question: How many peaks are de- 
tected in the WMAP CMB power spectrum? This 
question is hard to answer using parametric models 
for the CMB (e.g. CMBfast models) as these mod- 
els possess multiple peaks and troughs, which could 
potentially be fit to noise rather than real peaks in 
the data. To solve this, we have performed a non- 
parametric analysis of the WMAP power spectrum 
(Miller et al. 2003), which involves explaining the 
observed data (Yj) as = f{Xi) + Ci where f{Xi) 
is a orthogonal function (expanded as a cosine basis 
(}iCOs{iTrXi)) and Cj is the covariance matrix. The 
challenge is to "shrink" f{Xi) to keep the number of 
coefBcients (/3i) to a minimum. We achieve this using 
the method of Beran (2000), where the number of co- 
efficients kept is equal to the number of data points. 
This is optimal for all smooth functions and pro- 
vides valid confidence intervals. We also use mono- 
tonic shrinkage of specifically the nested subset 
selection (NSS). The main advantage of this method- 
ology is that it provides a "confidence ball" (in N 
dimensions) around f{Xi), allowing non-parametric 
interferences like: Is the second peak in the WMAP 
power spectrum detected? In addition, we can test 
parametric models against the "confidence ball" thus 
quickly assessing the validity of such models in N di- 
mensions. This is illustrated in Figure 2. 

6. Massive Model Testing 

We are embarked on a major effort to jointly search 
the 7-dimensional cosmological parameter-space of 
flm, ^DE^^b-T. neutrino fraction, spectral index and 
Ho using parametric models created by CMBfast and 
thus determine which of these models fit within the 
confidence ball around our f{Xi) at the 95% confi- 
dence limit. Traditionally, this is done by marginal- 
ising over the other parameters to gain confidence in- 
tervals on each parameter separately. This is a prob- 
lem in high-dimensions where the likelihood function 
can be degenerate, ill-defined and under-identified. 
Unfortunately, the nonparametric approach is com- 
putational intense as millions of models need to 
searched, each of which takes ~ 3 minute to run. 



To mitgate this problem, we have developed an 

intelligent method for searching for the surface of the 
confidence ball in high-dimensions based on Krig- 
ing. Briefly, kriging is a method of interpolation 
which predicts unknown values from data observed 
at known locations (also known as Gaussian process 
regression, which is a form of Bayesian inference in 
Statistics). There are many different metrics for eval- 
uating the kriging success; we use here the "Strad- 
dle" method which picks new test points based both 
on the overall distance from previous searched points, 
as well as being predicted to be close to the boimd- 
ary of the confidence ball. We have also developed a 
heuristic algorithm for searching for "missed peaks" 
in the likelihood space by searching models along the 
path joining previously detected peaks. We find no 
"missed peaks", which illustrates our kriging algo- 
rithm is effective in finding the surface of the confi- 
dence ball in this high dimensional space. 

We have distributed the CMBfast model com- 
putations over a local condor pool of computers. 
In Figure 3, we show preliminary results from this 
high-dimension search for the surface of the confi- 
dence ball and present joint 2D confidence limits on 
pairs of the aforementioned cosmological parameters. 
These calculations represent 6.8 years of CPU time 
to calculate over one million CMBfast models. In 
the near future, we will move this analysis to Tera- 
Grid, using VOTechBroker, and plan 10 million mod- 
els to fully map the surface of the confidence ball. We 
will also make available a Java-based web servise for 
accessing these models, and the WMAP confidence 
ball, thus allowing other users to rapidly combine 
their data with our WMAP constraints e.g., doing a 
joint constraint from LSS and CMB data. We are 
also working on possible convergence tests, and vi- 
sualization tools within VOTech, to access this high- 
dimensional data. 

7. Summary 

The two examples given here massive model test- 
ing of the WMAP data using nonparametric statis- 
tics and higher-order correlation functions of SDSS 
galaxies - represent a growing tr(uid in astrophysics 
and cosmology for massive statistical computations. 
Our plan is to develop the VOTechBroker to provide 
a power framework within which such massive as- 
tronomical analyse can be performed. As discussed. 
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Fig. 3. The results of our 7-dimensional parameter search using 1, 
shading for greyscales) color are models excluded at the 34% level, 
the 68% confidence ball and the red is the 95% confidence ball 



2 million models from CMBfast. The light blue (or lightest 
The purple (or mid-grade shading) are models excluded by 



the main goals of the VOTechBroker are to abstract from the user (either a person or another program) 
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the complexities of job submission and management 

on computational grids, as well as being a modii- 
lar "plug-in" design so other algorithms and soft- 
ware can be easily added. Finally, we plan to in- 
tegrate VOTechBroker into the AstroGrid workflow 
and MySpace environments, so it becomes a natu- 
ral repository for a host of advanced statistical al- 
gorithms than scale-up in preparation for petabyte- 
scale datasets and analyses. 
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